-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First big step towards solving issues 2217 and 2465 #3037
Conversation
8c1fca4
to
bb7765a
Compare
bb7765a
to
2ca6e7f
Compare
TestAgent looks better now. Still one test failing. Need to fix the |
c66eaa5
to
5fb0ce5
Compare
Tests are fixed and the agent tests now run fully in parallel. Going from 108 sec locally down to 15 ... |
5fb0ce5
to
c0e249d
Compare
|
5bae6bc
to
138d16f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted a few small things but overall looks good. I like the approach, lack of callbacks and associated de-nesting, and elimination of several test frameworks :-)
command/agent/agent.go
Outdated
a.wgServers.Add(1) | ||
go func() { | ||
defer a.wgServers.Done() | ||
up.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this ensure that Serve()
actually got called? It seems like this can still bail out too early since it's before the actual Serve()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not guarantee it. Until the Go team fixes the bug I think this is the best we can do with the minimal amount of overhead. This guarantees that the go routine that executes Serve
gets at least scheduled once before Start
returns. My understanding is that it yields again on the next blocking call which should happen within Serve
. I don't see how it would yield between up.Done()
and Serve
until I got it wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough - I'm not sure how you'd do better than this given the current situation w/o polling an endpoint, which would be truly horrible to do.
command/agent/agent.go
Outdated
srv.Shutdown() | ||
} | ||
for _, srv := range a.httpServers { | ||
a.logger.Println("[INFO] agent: Stopping HTTP endpoint", srv.Addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be plural "endpoints" too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since it now logs one line per endpoint with the specific address of the endpoint.
command/agent/agent.go
Outdated
err := a.delegate.Shutdown() | ||
a.logger.Print("[INFO] agent: delegate down") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message probably won't make sense to users. We could probably remove this, or move it to debug level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True
command/agent/agent_endpoint_test.go
Outdated
@@ -285,7 +277,10 @@ func TestAgent_Reload(t *testing.T) { | |||
}() | |||
|
|||
retry.Run(t, func(r *retry.R) { | |||
if got, want := len(cmd.httpServers), 1; got != want { | |||
if cmd == nil || cmd.agent == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think cmd
can be nil
here. Would this fail the race detector on cmd.agent
since it looks like the cmd.Run()
in a separate goroutine sets it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I check.
command/agent/dns_test.go
Outdated
in, _, err := c.Exchange(m, addr.String()) | ||
if err != nil { | ||
t.Fatalf("err: %v", err) | ||
} | ||
|
||
// Only 1 is passing, so we should only get 1 answer | ||
for _, a := range in.Answer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like leftover debug code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a look
command/agent/listen.go
Outdated
} | ||
|
||
func ListenUnix(addr string, perm FilePermissions) (net.Listener, error) { | ||
// todo(fs): move this somewhere else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This did get moved to startHTTP()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will remove
) | ||
|
||
func init() { | ||
rand.Seed(time.Now().UnixNano()) // seed random number generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got https://github.com/hashicorp/consul/tree/master/vendor/github.com/sean-/seed vendored and can use that, though this is probably cheaper and fine for unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a lib for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's shared across projects so there's some justification, though it could be paste-able pretty easily. There's some nuance in doing it correctly (doesn't matter for unit tests), so that probably tips in favor vs. pasting.
// during creation or startup instead of returning errors. It manages a | ||
// temporary data directory which is removed after shutdown. | ||
// | ||
// todo(fs): do we need the temp data dir if we run in dev mode? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't, though when I tried to use dev mode here I got some failures (some of the tests use features that don't work in dev mode). We will need to be able to tell the test agent whether or not to use dev mode and default that to true
most likely, and opt-out in a few places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like something for a different PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can check what the impact is but since we're not using all CPUs on the agent test yet I think there is more potential somewhere else. I got at most half of the cores used. With go test --parallel 24
test execution went from 15 to 11 sec so not such a huge gain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's probably better coverage-wise to activate non-dev as well. Dev mode bypasses some key places that are worth kicking during our unit tests.
command/agent/testagent.go
Outdated
} | ||
|
||
// BoolTrue and BoolFalse exist to create a *bool value. | ||
var BoolTrue = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be const
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, since you can't create a reference to a const. Tried that, too :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes I miss (a few small parts of) C++ (every once in a while).
command/util_test.go
Outdated
|
||
func nextConfig() *agent.Config { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we kill this stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm working on that right now. That and then the code in command/server_test.go
fd0843e
to
2191f6a
Compare
2e8f77d
to
f0401ee
Compare
rebased to master to pick up Azure RetryJoin and latest go-sockaddr |
3cbe801
to
f673590
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted some small things but otherwise looks good to merge!
command/agent/acl_test.go
Outdated
if err == nil || !strings.Contains(err.Error(), rootDenied) { | ||
t.Fatalf("err: %v", err) | ||
} | ||
|
||
// The ACL master token should also not call the server, but should give | ||
// us a working agent token. | ||
acl, err := agent.resolveToken("towel") | ||
// us a working a token. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should still be "agent".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search and replace. You've got eagle eyes.
command/agent/agent.go
Outdated
@@ -75,16 +78,20 @@ type clientServer interface { | |||
// mode, it runs a full Consul server. In client-only mode, it only forwards | |||
// requests to other Consul servers. | |||
type Agent struct { | |||
// id is an optional log prefix. | |||
id string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be confusing w/the existing node ID. Maybe we can call this logPrefix
so it's more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to just provide a properly configured logger which is used everywhere. Since I've fixed the logger passing we can move this into the TestAgent
command/agent/agent.go
Outdated
} | ||
|
||
// wait for servers to be up | ||
// todo(fs): not sure whether this is the right approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty reasonable - if the server doesn't start then we should hit the timeout case and abort the startup, which I think is the minimum we should do. Failing the agent later if a server error occurs seems good for robustness, but should be done separately in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(that'll let the process supervisor try to clean it up)
command/agent/agent.go
Outdated
break | ||
} | ||
l, err = tls.Listen("tcp", p.Addr, tlscfg) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a default
here so we don't just panic if we mis-configure the addrs
in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
command/agent/agent.go
Outdated
// Shutdown is called before the Serve go routine was scheduled then | ||
// the Serve go routine never returns. This deadlocks the agent | ||
// shutdown for some tests since it will wait forever. | ||
if strings.Contains("*tls.listener", fmt.Sprintf("%T", l)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this pattern vs. a type assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type assertions don't work with non-exported types :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the comment to make that clear since you're not the first one to ask.
// srv.Close() | ||
|
||
// graceful shutdown | ||
ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to log anything if we timeout here? Right now I don't think we'd know if we were hitting this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
command/agent/config.go
Outdated
// connections to consul. | ||
func (c *Config) IncomingTLSConfig() (*tls.Config, error) { | ||
tc := &tlsutil.Config{ | ||
VerifyIncoming: c.VerifyIncoming || c.VerifyIncomingHTTPS, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is only set up properly for HTTPS listeners (not RPC listeners), so we should rename it to be more specific and update the comment.
time.Sleep(50 * time.Millisecond) | ||
p := &UserEvent{Name: "second"} | ||
if err := a.UserEvent("dc1", "root", p); err != nil { | ||
t.Fatalf("err: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work inside a goroutine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which part? the time.Sleep
or the t.Fatal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The t.Fatal
} | ||
list = ln | ||
} else { | ||
list = tcpKeepAliveListener{ln.(*net.TCPListener)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this detail got lost in the refactor, was that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the tcpKeepAliveListener
would be used by default in the http
package - which it is but only if you use the ListenAndServe
methods - which we don't. I'll add it back in. Forgot to do that.
command/exec_test.go
Outdated
waitForLeader(t, a1.httpAddr) | ||
t.Parallel() | ||
cfg := agent.TestConfig() | ||
cfg.DisableRemoteExec = agent.Bool(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be &BoolFalse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
* use agent logger for consul/serf/raft/dns/agent/... * support optional id for concurrent tests
* don't use retry to try restarting the agent this caused some issues when the startup would fail in a separate go routine * clear out the data directory on every retry since the ports are stored in the raft data files * set a unique id for every agent to allow for tracking of concurrent output
* capture all output to test.log only and print only failures * always print output on travis * test github.com/hashicorp/consul/consul separately until we've found the interactions with the other packages during testing.
We need to init the flagset that cli uses to generate the help outside of the Run method since Run isn't called anymore for printing help.
* refactor DNS server to be ready for multiple bind addresses * drop tcpKeepAliveListener since it is default for the HTTP servers * add startup timeout watcher for HTTP servers identical to DNS server
f673590
to
ec72969
Compare
command/exec_test.go
Outdated
@@ -31,7 +31,7 @@ func TestExecCommand_implements(t *testing.T) { | |||
func TestExecCommandRun(t *testing.T) { | |||
t.Parallel() | |||
cfg := agent.TestConfig() | |||
cfg.DisableRemoteExec = agent.Bool(false) | |||
cfg.DisableRemoteExec = &agent.BoolFalse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were a bunch more of these in this file (I didn't tag them all) :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert to the Bool()
function since it is more flexible in situations where you need to derive the pointer from another value. The main issue is the use of *bool
in the first place IMO.
select { | ||
case <-done: | ||
// server down within timeout | ||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just check the error from ctx.Done()
here, this is correct but you don't need the extra channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and a good idea as well since we're timing out every time. The server doesn't shut down cleanly and I need to figure out why. I've refactored it to not use the second channel and will investigate the code that hangs the server during shutdown.
command/agent/testagent.go
Outdated
agent.LogWriter = a.LogWriter | ||
agent.logger = log.New(logOutput, id, log.LstdFlags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the log prefix anymore in test.log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm they are there now that I'm looking. Perhaps just some of the tests are missing the prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only tests that use the TestAgent
set the prefix.
1yhud3gtn56jf 2017/05/31 09:42:26 [INFO] raft: Node at 127.0.0.1:50636 [Candidate] entering Candidate state in term 2
1yhud3gtn56jf 2017/05/31 09:42:26 [DEBUG] raft: Votes needed: 1
1yhud3gtn56jf 2017/05/31 09:42:26 [DEBUG] raft: Vote granted from 127.0.0.1:50636 in term 2. Tally: 1
1yhud3gtn56jf 2017/05/31 09:42:26 [INFO] raft: Election won. Tally: 1
1yhud3gtn56jf 2017/05/31 09:42:26 [INFO] raft: Node at 127.0.0.1:50636 [Leader] entering Leader state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use the test name as an id as you suggested earlier. That makes more sense even if the logs are a bit misaligned. grep TestName test.log
will then show you everything you need to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ grep TestDNS_Recurse_Truncation test.log
=== RUN TestDNS_Recurse_Truncation
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:127.0.0.1:54916 Address:127.0.0.1:54916}]
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] raft: Node at 127.0.0.1:54916 [Follower] entering Follower state (Leader: "")
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] serf: EventMemberJoin: Node 77c888a2-7a88-a600-22a6-33064aa14734 127.0.0.1
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] consul: Adding LAN server Node 77c888a2-7a88-a600-22a6-33064aa14734 (Addr: tcp/127.0.0.1:54916) (DC: dc1)
TestDNS_Recurse_Truncation - 2017/05/31 09:51:15 [INFO] serf: EventMemberJoin: Node 77c888a2-7a88-a600-22a6-33064aa14734.dc1 127.0.0.1
pushed changes to master |
not done yet